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ABSTRACT 


We  study  parallel  two-level  overlapping  Schwarz  algorithms  for  solving  nonlinear  finite  element 
problems,  in  particular,  for  the  full  potential  equation  of  aerodynamics  discretized  in  two  dimensions 
with  bilinear  elements.  The  overall  algorithm,  Newton-Krylov-Schwarz  (NKS),  employs  an  inexact 
finite-difference  Newton  method  and  a  Krylov  space  iterative  method,  with  a  two-level  overlapping 
Schwarz  method  as  a  preconditioner.  We  demonstrate  that  NKS,  combined  with  a  density  up  winding 
continuation  strategy  for  problems  with  weak  shocks,  is  robust  and  economical  for  this  class  of 
mixed  elliptic-hyperbolic  nonlinear  partial  differential  equations,  with  proper  specification  of  several 
parameters.  We  study  upwinding  parameters,  inner  convergence  tolerance,  coarse  grid  density, 
subdomain  overlap,  and  the  level  of  fill-in  in  the  incomplete  factorization,  and  report  their  effect  on 
numerical  convergence  rate,  overall  execution  time,  and  parallel  eiBiciency  on  a  distributed-memory 
parallel  computer. 
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1  Introduction 


In  the  past  few  years  domain  decomposition  methods  for  linear  partial  differential  equations, 
including  overlapping  Schwarz  methods  [9,  12,  13,  37],  have  graduated  from  theory  into 
practice  in  many  applications  [17,  27,  28,  34].  In  this  paper,  we  study  several  aspects  of  the 
parallel  implementation  of  a  Krylov- Schwarz  domain  decomposition  algorithm  for  the  finite 
element  solution  of  the  nonlinear  fuU  potential  equation  of  aerodynamics,  extending  our 
model  studies  of  linear  convection-diffusion  problems  in  [5]  and  of  linear  aerodynamic  design 
optimization  problems  in  [33].  Newton-Krylov  methods  [2,  3, 14, 15,  39]  are  potentially  well 
suited  and  increasingly  popular  for  the  implicit  solution  of  nonlinear  problems  whenever  it 
is  expensive  to  compute  or  store  a  true  Jacobian.  We  employ  a  combined  algorithm,  called 
Newton-Krylov-Schwaxz,  and  focus  on  the  interplay  of  the  three  nested  components  of  the 
algorithm,  since  the  amount  of  work  done  in  each  component  affects  and  is  affected  by  the 
work  done  in  the  others. 

Newton-Krylov-Schwarz  is  a  general  purpose  parallel  solver  for  nonlinear  partial  differ¬ 
ential  equations  and  has  been  applied  to  complex  multicomponent  systems  of  compressible 
and  reacting  flows  in,  e.g.,  [6,  7, 29].  This  paper  is  concerned  with  the  simpler  scalar  problem 
of  the  full  potential  equation,  which  describes  inviscid,  irrotational,  isentropic  compressible 
flow.  Though  the  fuU  potential  model  is  highly  idealized,  it  remains  the  model  of  choice 
of  external  aerodynamic  designers  to  date,  because  codes  based  thereupon  offer  reasonable 
turnaround  times  and  in  many  cases  high  accuracy  compared  to  state-of-the-art  Navier- 
Stokes  solvers.  Though  derived  under  the  condition  of  isentropy,  the  full  potential  model 
remains  useful  in  flows  with  weak  shocks,  with  pre-shock  Mach  numbers  of  about  1.2  or 
less.  It  can  also  be  extended  by  boundary  layer  patching  to  incorporate  viscous  effects,  by 
a  branch  cut  to  accommodate  lift,  and  by  source  terms  to  simulate  powered  engines.  In 
engineering  practice,  accurately  modeling  such  nonideal  effects  in  complex  geometries  ac¬ 
counts  for  almost  aU  of  the  lines  of  code,  but  the  solution  of  the  resulting  discrete  equations 
accounts  for  the  majority  of  the  execution  time.  The  lower  per-ceU  storage  and  computa¬ 
tional  requirements  of  the  potential  model  allow  the  use  of  grids  dense  enough  to  achieve  low 
truncation  error  levels  for  complex  geometries.  The  full  potential  equation  also  avoids  the 
spurious  entropy  generation  near  stagnation  often  associated  with  Euler  and  Navier-Stokes 
codes  for  industrial  complex  geometries  of  interest.  We  justify  the  simply  coded  examples 
in  this  paper  by  our  focus  on  a  solution  algorithm  that  should  not  require  any  changes  other 
than  greater  irregularity  in  its  sparse  data  structures  to  be  useful  in  more  practical  settings. 

With  Newton’s  method  as  the  outer  iteration,  a  highly  nonsymmetric  and/or  indefinite 
large,  sparse  Jacobian  equation  needs  to  be  solved  at  every  iteration  to  a  certain  accu¬ 
racy,  which  is  often  progressively  tightened  in  response  to  a  falling  nonlinear  residual  norm. 
The  most  popular  family  of  preconditioners  for  large  sparse  Jacobians  on  structured  or  un¬ 
structured  grids,  incomplete  factorization,  is  difiicult  to  parallelize  efficiently  flop-for-flop  in 
its  global  form.  In  our  approach,  the  ILU-preconditioner  for  the  Newton  correction  equa¬ 
tions  is  replaced  by  a  multi-level  overlapping  Schwarz  preconditioner.  The  latter  is  not 
only  scalably  paraUelizable  (up  to  available  granularities),  but  also  possesses  an  asymptoti¬ 
cally  optimal  mesh-  and  granularity-independent  convergence  rate  for  elliptically  dominated 
problems.  Our  two-level  overlapping  additive  Schwarz  algorithm  uses  a  non- nested  coarse 
space.  Subdomain  granularity,  quality  of  subdomain  solves,  coarse  grid  density,  strategy  for 
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coarse  grid  solution,  and  inner  iteration  termination  criteria  are  important  factors  in  overall 
performance.  We  report  numerical  experiments  on  an  IBM  SP2  with  up  to  32  processors. 

The  outline  of  this  paper  is  as  follows.  In  §2,  we  briefly  derive  the  form  of  the  full 
potential  equation  that  serves  as  the  point  of  departure  for  the  numerics.  The  finite  ele¬ 
ment  discretization  and  the  construction  of  an  approximate  Jacobian  for  the  full  potential 
equation  are  discussed  in  §3.  §4  is  devoted  to  the  description  of  the  basic  components  of 
the  NKS  algorithm.  Several  parallel  implementation  issues  are  explained  in  §5.  Numerical 
results  are  summarized  in  §6.  Finally,  we  offer  some  general  remarks  on  the  use  of  NKS 
algorithms  in  §7. 


2  The  full  potential  problem 


For  complet.eness,  we  summarize  the  derivation  and  assumptions  of  the  full  potential  equa¬ 
tion  of  aerodynamics.  For  a  more  thorough  development,  see  [23]. 

The  equation  of  mass  conservation  in  a  steady  state  fluid  flow  can  be  written  in  diver¬ 
gence  form, 

V  •  {pv)  =  0,  (1) 

where  v  =  (vi,U2)^  is  the  velocity  and  p  is  the  local  density,  respectively.  We  assume  that 
the  flow  is  irrotational,  which  implies  that  there  exists  a  velocity  potential  $  such  that 
V  =  V$.  Furthermore,  the  relation  ^  =  const,  holds  for  isentropic  flow  of  a  perfect  gas. 
With  the  above  assumptions,  we  can  integrate  the  inviscid  momentum  equations  and  obtain 
Bernoulli’s  equation 

—  -\ — - —  =  const..  (2) 
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where  ?  =  (v?  +  is  the  local  flow  speed.  The  sound  speed  a  is  defined  by 

=  dpi  dp.,  where  p  is  the  local  static  pressure.  By  means  of  the  above  relations,  the  five 
unknown  fields  ui,  V2,p,  a,  and  p  can  be  eliminated  in  favor  of  a  single  unknown  function  $, 
which  solves  the  full  potential  equation: 

V  •  (p($)V$)  =  0.  (3) 


Two  forms  of  this  equation  are  standard  in  the  literature,  depending  upon  whether  the 
density  is  referenced  to  a  uniform  freestream  (at  oo)  or  to  a  stagnation  point  condition.  We 
derive  the  freestream  version  as  follows.  From  Bernoulli’s  equation  (2). 


we  have  that 
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where  M  =  q/a  is  the  Mach  number,  and  Moo  is  the  freestream  Mach  number.  From  the 
definition  of  the  sound  speed  and  the  pressure-density  relation,  we  obtain 
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dp 


(6) 
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or  equivalently,  =  {-^V  Therefore, 

/)($)  =  Poo  In - 2  )l  •  vv 

Observe  that  while  the  density  is  positive  in  regions  of  validity,  (3)  may  be  locally  hyperbolic. 

Equation  (3)  requires  boundary  conditions.  In  this  paper,  we  consider  only  subsonic 
farfield  boundaries.  Since  our  emphasis  is  on  the  performance  of  the  Schwarz  precondi¬ 
tioning,  we  study  a  symmetric  nonlifting  case,  thus  avoiding  consideration  of  the  Kutta- 
Joukowsky  boundary  condition.  To  keep  the  geometry  of  the  domain  trivial,  we  use  a 
classical  transpiration  boundary  condition  on  a  slit  to  represent  the  airfoil.  Transpiration 
refers  to  a  continuously  parameterized  injection  and  removal  of  fluid  along  a  portion  of 
the  boundary  to  create  a  recirculation  pocket  with  a  bounding  streamline  attached  to  the 
domain  boundary  at  both  ends,  over  which  the  flow  of  interest  passes  inviscidly.  Transpira¬ 
tion  is  implemented  as  an  inhomogeneous  Neumanii  condition.  A  theoretical  discussion  of 
the  use  of  transpiration  boundary  conditions  to  model  displaced  surfaces  can  be  found  in 
[2.5].  For  the  farfield  boundary  condition,  we  use  Dirichlet  values  of  the  potential  upstream. 
More  sophisticated  farfield  conditions  are  possible,  and  are  required  in  the  case  of  a  lifting 
airfoil,  but  these  conditions  are  sufficient  for  excellent  agreement  of  our  numerical  results 
with  standard  nonlifting  solutions. 


3  Finite  element  approximation 

Following  Boeing’s  TRANAIR  code  [42],  we  employ  a  finite  element  formulation  of  the 
two-dimensional  full  potential  equation  using  bilinear  elements.  The  existence,  uniqueness, 
and  regularity  of  the  solution  are  not  central  to  this  paper,  but  have  been  discussed  in  the 
papers  [31,  35]  and  references  therein.  Related  finite  element  approaches  for  this  class  of 
full  potential  equations  can  also  be  found  in  [1,  16].  A  finite  volume  scheme  was  given  in 
[30]  and  a  mortar  element-based  domain  decomposition  scheme  was  recently  parallelized  to 
high  efiiciency  in  [26]. 


3.1  Basic  finite  element  scheme 

The  finite  element  problem  is  formulated  in  terms  of  the  weak  form 

a(#,t;)=  f  /9($)V$  •  Vu  dfi. 


We  use  bilinear  elements  on  a  rectangular  partition  of  Q  denoted  by  9,h  =  {r,-,  ^  =  1, . . . ,  Mh}- 
Let  {<i>i{x,y)}  be  the  usual  nodal  basis  functions.  The  numerical  solution  we  seek  has  the 
form 

^x,y)  = 

and  satisfies  the  following  nonlinear  algebraic  equations 


V$V?;  dCl  =  0, 


(8) 
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for  all  V  in  the  test  function  space.  Here  is  the  center  point  of  the  rectangle  n. 

To  simplify  and  speed  up  numerical  integration,  we  introduce  certain  approximations  when 
dealing  with  some  of  the  nonlinear  forms.  The  way  that  we  treat  the  local  nonlinear 
numerical  integration  in  (8)  is  like  that  in  [42].  Let  us  define  a  system  of  nonUnear  equations 


F(#) 


Ti($i ,  •  •  • ,  $jv,  p($i,  •  •  •  5  ^iv)) 


=  0, 


(9) 


where 


/;•($, p($))=  Y.  /  v^V(j>idn, 

T€Qh 


(10) 


and  {x^,  Vj)  is  the  center  point  of  the  element  r.  We  construct  the  Jacobian  matrix  J  -  {Jij} 
of  the  noidinear  system  F  =  0,  approximately,  as  Mows.  For  each  pair  of  indices  i,j,  we 
define 


+ 


S  / 

Jr 


TgOft 


dp  d 

dlWj 


where  s  =  ||V#||1.  To  simplify  the  numerical  integration,  the  exact  Jacobian  value  above 
is  replaced  by 

Ji,  =  Y.  '>(«(*-  Vr))  + 


re^h 


re^h 


(11) 


where  the  value  of  dp/ds  is  calculated  at  the  element  center  point.  We  remark  here  that, 
because  the  density  function  p  is  not  a  constant,  the  Jacobian  matrix  is  generally  non- 
symmetric  and  possibly  indefinite.  The  explicit  construction  of  the  Jacobian  matrix  is  not 
necessary  if  we  use  an  unpreconditioned  Newton-Krylov  method;  however,  to  implement  a 
Schwarz  preconditioner,  explicit  approximation  of  the  Jacobian  matrix  is  needed  in  each 
subdomain. 


3.2  Density  upwinding  schemes 

For  subsonic  problems,  the  above  mentioned  finite  element  method  is  sufficient;  however, 
for  transonic  cases  upwinding  has  to  be  introduced  in  the  density  calculation  in  order  to 
capture  the  weak  shock  in  the  solution.  The  proper  use  of  an  upwinding  scheme  is  essential 
both  to  the  success  of  the  overall  approach  in  finding  the  correct  location  and  strength  of 
the  shock  and  to  the  convergence,  or  the  fast  convergence,  of  the  inexact  Newton’s  method. 

As  mentioned  earlier,  density  p  is  assumed  to  be  a  constant  in  each  element,  and  this 
constant  is  ordinarily  determined  by  the  four  values  of  $  at  the  corners  of  the  element, 
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Figure  1:  The  finite  element  stencil.  $  is  stored  at  the  cell  vertices  and  p  at  the  cell  centers. 

through  (7).  Following  [24,  42],  if  an  element  is  determined  to  be  supersonic,  or  nearly  so, 
its  density  value  is  replaced  by 

p  =  p-liv-w.p,  (12) 

where  V  is  the  normalized  element  velocity  and  V-p  is  an  upwind  undivided  difference. 
For  example,  with  reference  to  Fig.  l,iiV  =  (F^,  Vy)  and  V^,  Fj,  >  0  in  the  element  marked 
with  pi,  then 

h  =  Pi  ~  piyxipi  —  Pe)  +  VyiPi  —  Ps))' 

Here  p  is  the  element  switching  function, 

/X  =  i/Q  max{0, 1  —  (13) 

where  M  is  the  element  Mach  number,  Me  is  a  pre-selected  cutoff  Mach  number  chosen  to 
introduce  dissipation  just  below  Mach  1.0,  and  i/q  is  a  constant  usually  set  to  something 
between  1.0  and  3.0  to  increase  the  amount  dissipation  in  the  supersonic  elements.  Pa¬ 
rameters  Me  and  i/q  may  be  varied  to  advantage  between  Newton  steps  in  problems  with 
shocks.  Roughly  speaking.  Me  controls  the  spatial  extent  of  the  upwinding;  as  it  drops 
below  1.0  upwinding  is  triggered  in  a  greater  number  of  subsonic  (but  nearly  sonic)  cells. 
Me,  t'o,  and  F  together  control  the  amount  of  the  upwinding  in  a  triggered  cell.  A  low 
Me  and  a  high  i>o  stabilize  convergence  but  diffuse  the  shock.  As  iterations  progress,  Me 
should  approach  1.0  and  vq  should  be  decreased  to  steepen  up  a  shock  whose  location  and 
strength  has  converged.  A  weU  resolved  shock  will  take  many  Newton  steps  to  settle  on  its 
correct  location,  whereas  a  diffuse  shock  centers  quickly  on  this  location.  A  carefully  chosen 
sequence  of  Me  and  i/q  can  considerably  accelerate  the  Newton’s  convergence;  more  details 
on  this  “viscosity  damping”  can  be  found  in  [41].  Another  way  to  control  the  convergence 
of  Newton’s  method  is  through  the  use  of  iterated  maximization  of  the  switching  function, 
as  described  below  and  apparently  first  discussed  in  [22]. 

For  each  element,  p  as  defined  by  (13)  is  called  the  zeroth  level  switching  function  and 
is  denoted  more  precisely  as  In  Fig.  1,  there  is  a  nonzero  pf’^  for  each  element  marked 
with  pi.  The  first  level  switching  function  for  the  element  marked  with  pi  is  defined  as 
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namely,  /xf  ^  is  the  maximum  of  the  all  the  n  values  in  its  immediate  neighborhood.  A 
{k  +  l)-level  switching  function  is  defined  recursively  as  the  maximum  of  the  neighboring 
k-level  fx  values. 

Results  ioT  k  =  2  are  reported  in  §6.  A  rather  tight  Mach  cutoff  is  used,  namely 
=  0.95,  and  we  set  uq  to  1.0. 

We  remark  that  large  k  results  in  greater  discrete  data  dependency,  or  larger  effective 
stencil  size,  in  both  the  nonhnear  function  and  the  Jacobian.  For  example,  il  k  =  0,  the 
stencil  contains  at  most  9  points  the  nine  mesh  points  immediately  surrounding 
in  Fig.  1).  If  A;  =  1,  then  some  of  the  “x”  points  may  join  the  stencil  depending  the 
flow  direction,  and  the  stencil  may  contain  as  many  as  16  points.  The  increase  in  the 
stencil  bandwidth  does  not  cause  much  of  a  problem  in  the  nonlinear  function  evaluation, 
but  would  substantially  increase  the  memory  requirement  of  the  Jacobian  matrix,  wWch 
is  constructed  and  stored  for  the  Schwarz  preconditioner  at  the  beginning  of  each  Newton 
iteration.  To  keep  the  memory  requirements  small  in  practice,  we  do  not  calculate  or  store 
the  matrix  elements  introduced  by  using  the  iterated  switching  function.  Our  numerical 
experiments  show  that  this  extra  level  of  approximation  of  the  Jacobian  matrix  does  not,  in 
fact,  appreciably  reduce  its  power  as  a  preconditioner.  This  is  analogous  to  the  practice  in 
[7]  of  using  Jacobian  blocks  based  on  first-order  upwinding  to  drive  a  second-order  upwinded 
residual  to  zero,  in  an  inexact  Newton  iteration.  Though  not  much  discussed  in  the  theory  of 
approximate  Newton  methods  for  systems  arising  from  PDFs,  such  techniques  are  commonly 
applied  in  stationary  iterations  in  steady-state  aerodynamics  codes.  Especially  in  three 
space  dimensions,  using  simplified  upwinding  in  the  Jacobian  matrix  dramatically  reduces 
cost  at  a  small  expense  in  convergence  rate  degradation. 


4  Newton-Krylov-Schwarz  algorithms 

NKS  is  a  family  of  general  purpose  algorithms  for  solving  nonlinear  boundary  value  prob¬ 
lems  of  partial  differential  equations.  In  terms  of  software  development,  NKS  has  three 
components  that  can  be  handled  independently.  However,  to  achieve  reasonable  overall 
convergence,  the  three  components  have  to  be  tuned  simultaneously.  We  discuss  these 
components  in  turn. 

4.1  The  matrix-free  Newton  method 

In  this  subsection,  we  briefly  discuss  the  well  known  matrix-free  inexact  finite-difference 
Newton  algorithm,  and  the  Eisenstat- Walker  forcing  functions  [15].  Starting  from  an  initial 
guess  $0,  which  is  sufficiently  close  to  the  solution,  a  solution  of  the  nonlinear  system  (9) 
is  sought  by  using  an  inexact  Newton  method:  For  some  Tjk  €  [0,1)  find  that  satisfies 

||F($fc)  -b  J{^k)sk\\  <  rik  (14) 

and  set  +  hsk,  where  Afc  G  (0,1)  is  determined  by  a  line  search  procedure 

[11].  In  practice,  the  method  is  insensitive  to  the  details  of  the  method  used  to  determine 
Afc.  Much  more  important  is  nonlinear  continuation  in  grid  density,  dissipation,  and  other 
parameters.  The  iteration  is  continued  until  convergence,  typically  defined  in  terms  of  a 
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sufficiently  small  ||F(#fc)||.  The  vector  s*  is  obtained  by  approximately  solving  the  linear 
Jacobian  system 

with  a  Krylov  space  iterative  method.  The  action  of  Jacobian  J  on  an  arbitrary  Krylov 
vector  w  can  be  approximated  by 

J{^k)w  «  ^  +  €W)  -  F{^k))  ■ 

Finite-dilferencing  with  e  makes  such  matrix-free  methods  potentially  more  susceptible  to 
finite  word-length  effects  than  ordinary  Krylov  methods.  Left  preconditioning  of  the  Jaco¬ 
bian  with  an  operator  can  be  accommodated  via 

^  {B-^F((^k  +  €w))  -  #($fc))  , 

where  F{^k)  =  B~^F{^k)  is  stored  once,  and  right  preconditioning  via 

J{^k)B-'^w  «  i  (F(($fc  -I-  eB-^w))  -  F{^k))  •  (15) 

Right  preconditioning  is  preferable  when  the  focus  is  on  comparing  different  preconditioners 
in  vitro,  since  the  true  linear  residual  norm  that  is  measured  as  a  by-product  in  Krylov 
method  GMRES  (see  next  subsection)  and  used  in  the  termination  test  is  independent 
of  any  right  preconditioning.  On  the  other  hand,  any  left  preconditioning  changes  this 
by-product  residual  norm.  For  this  very  reason,  left  preconditioning  may  be  preferable 
when  GMRES  is  applied  in  vivo  as  the  solver  for  an  inexact  Newton  method.  When  the 
preconditioning  B~^  is  of  high  quality,  the  left-preconditioned  residual  serves  as  an  estimate 
of  the  error  in  the  Newton  update  vector.  This  estimate  can  be  employed  in  a  termination 
condition.  In  this  paper,  one  of  our  emphases  is  assessing  preconditioner  quality,  and  we 
report  only  right-preconditioned  results. 

The  most  expensive  component  of  the  algorithm  is  the  solution  of  the  linear  system  with 
the  Jacobian  at  each  Newton  iteration.  As  discussed  in  Eisenstat  and  Walker  [15],  when 
is  far  from  the  solution,  the  local  linear  model  used  in  deriving  the  Newton  method  may 
disagree  considerably  with  the  nonlinear  function  itself,  and  it  is  unproductive  to  “over¬ 
solve”  these  linear  systems.  We  tested  several  stopping  conditions,  including  those  discussed 
in  [15],  and  found  that  the  best  choice  for  our  problems,  based  on  elapsed  execution  time  for 
a  fixed  relative  nonlinear  residual  norm  reduction,  is  simply  to  set  T]k  =  10“^||F($jk)||2-  In 
fact,  even  the  looser  rjk  =  10“^||F($fc)||2  is  sufficient  for  the  first  few  Newton  iterations,  but 
not  much  time  is  saved  by  switching  dynamically  among  these  two  already  loose  criteria, 
so  we  use  the  first  throughout. 

4.2  Krylov  iterative  methods 

We  use  the  GMRES  method  [36],  to  solve  the  linear  system  of  algebraic  equations 

Px  =  b,  (16) 
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where  P  is  the  matrix  appears  in  (15),  and  b  is  the  negative  of  the  nonlinear  Newton 
residual  vector  in  (9).  The  method  begins  with  an  initial  approximate  solution  xq  £  R  and 
an  initial  residual  tq  =  b  -  Pxq.  At  the  iteration,  a  correction  vector  z,n  is  computed 
in  the  Krylov  subspace 

ICmiro)  =  span{ro,Pro,---,P'^~^ro} 

that  minimizes  the  residual,  min2gx;m{j-o)  11^  ~  iterate  is  thus  Xm  = 

xq  -f  Zyn-  To  fit  the  available  memory,  one  is  sometimes  forced  to  use  the  A;-step  restarted 
GMRES  method  [36].  However,  in  this  case  neither  an  optimal  convergence  property  nor 
even  convergence  is  guaranteed.  In  our  experiments,  we  do  not  need  to  solve  the  linear 
systems  very  accurately;  i.e.,  rj  =  10“^  in 

||6-Pa;„»||a<7/||ro||2 

is  sufficient  to  capture  an  accurate  solution  to  the  nonlinear  problem,  in  both  subsonic  and 
transonic  cases.  We  do  observe  that,  for  certain  maximum  Krylov  subspace  dimensions 
(for  example  30,  in  a  problem  with  approximately  10^  times  as  many  discrete  unknowns) 
and  certain  Mach  numbers  (Moo  =  0.8),  the  restarted  GMRES  can  never  reduce  the  initial 
residual  below  10"®.  In  other  words,  there  is  no  linear  convergence.  It  is  further  noticed 
in  such  cases  that  the  residual  norm  measured  as  a  by-product  in  GMRES  is  no  longer  the 
same  as,  or  even  close  to,  the  true  residual  norm  except  at  the  restarting  points,  where  it 
is  freshly  updated."*  A  loose  linear  convergence  tolerance  avoids  this  problem  by  returning 
to  the  Newton  method  with  a  step  that  is  far  from  exact.  In  the  delicate  balance  between 
few  nearly  exact  Newton  steps  with  expensive  inner  linear  solutions  and  many  inexact 
Newton  steps  with  bounded-cost  inner  linear  solutions,  we  find  the  bottom  line  of  overall 
execution  time  best  served  by  bounding  the  inner  linear  work.  This  approach  is  also  found 
most  effective  in  the  context  of  inviscid  aerodynamics  based  on  the  primitive  variable  Euler 
equations  in  [7].  It  deprives  Newton’s  method  of  its  asymptotic  quadratic  convergence,  but 
provides  steep  linear  convergence. 

4.3  Two-level  overlapping  Schwarz  preconditioners  with  non-nested  coarse 
spaces 

In  this  subsection,  we  discuss  a  two-level  overlapping  Schwarz  preconditioner  with  inexact 
subdomain  solvers  and  non-nested  coarse  grid.  Let  Q  be  the  domain  of  the  full  potential 
equation.  We  first  partition  the  domain  into  nonoverlapping  substructures  Cli,  i  =  1,  •  •  • ,  A. 
To  obtain  an  overlapping  decomposition  of  the  domain,  we  extend  each  subregion  ft;  to  a 
larger  region  ft-,  i.e.,  ft;  C  ft;.  Only  simple  box  decomposition  is  considered  in  this  paper: 
aU  the  subdomains  ft;  and  ft;  are  rectangular  and  are  made  up  of  integral  numbers  of  fine 
mesh  cells.  For  simplicity,  we  also  assume  that  all  substructures  are  of  the  same  size.  More 
precisely,  the  size  of  ft;,  i  =  1,  •  •  •,  A,  is  x  Ay  and  the  size  of  ft  -  is  x  A^,  where  the 

^We  believe,  after  Saad  (personal  communication),  that  this  may  be  due  to  a  lack  of  floating  point 
commutativity  in  the  product  that  expresses  arm  in  GMRES,  namely  arm  =  PVmyy  where  Vm  is  a  Gram- 
Schmidt  basis  for  Km  and  ^  is  a  coefficient  vector  of  dimension  m  that  satisfies  a  related  least  squares 
problem  (see  [36]).  The  effect  seems  related  to  drastic  variations  in  the  magnitude  of  successive  elements  of 

y- 
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H'  are  chosen  so  as  to  ensure  a  discrete  overlap,  denoted  by  ovlp,  which  is  uniform  in  the 
number  of  fine  grid  cells  aU  around  the  perimeter,  i.e., 

ovlp  =  {Hi  -  H,)/2  =  iH'y  -  Hy)/2, 

for  interior  subdomains.  For  boundary  subdomains,  we  simply  cut  off  the  part  that  is 
outside  fi.  Fig.  3,  which  appears  later  with  the  definition  of  numerical  boundary  conditions, 
illustrates  a  decomposition  with  an  overlap  of  three  fine  mesh  cells. 

On  each  extended  subdomain  we  construct  a  so-called  subdomain  preconditioner 
Bi  =  {Jij},  where  the  node  indexed  by  (i,  j)  belongs  to  the  interior  of  fi-.  is  calculated 
by  using  the  formula  (11).  The  density  upwinding  discussed  earlier  is  used  in  the  transonic 
cases.  Homogeneous  Dirichlet  BCs  are  used  on  the  internal  subdomain  boundary  dfi,-  n  ft, 
and  the  appropriate  external  boundary  condition  is  used  on  the  physical  boundary  if  present. 

We  next  discuss  the  construction  of  the  coarse  grid  and  the  coarse  grid  preconditioner. 
The  coarse  grid  is  built  independently  of  the  fine  mesh.  We  cover  ft  with  another  uniform 
rectangular  mesh  ft^  =  {r/^,i  =  1,...,Mh},  and  at  each  coarse  node  we  introduce  a 
bilinear  finite  element  basis  function  ^i{x,y).  The  set  of  coarse  nodes  is  not  generally  a 
subset  of  the  fine  mesh  nodes.  In  other  words,  the  discrete  subspaces  defined  by  the  two 
meshes  are  generally  non-nested  [4].  Both  coarse  and  fine  grids  cover  the  entire  ft,  and 
they  share  the  same  boundary,  which  they  both  resolve  exactly  because  of  its  prescribed 
simplicity.  (The  case  of  a  multi-level  Schwarz  preconditioner  for  geometrically  complex 
grids,  in  which  only  the  finest  level  exactly  resolves  the  boundary  geometry,  is  considered 
in  [10].)  The  coarse  grid  preconditioning  matrix  Bq  is  defined  by  using  formula  (11)  with 
respect  to  the  basis  functions  {$»}.  The  coarse  grid  matrix  arises  from  an  independent 
discretization,  not  an  agglomeration  of  fine  grid  ihatrix.  No  upwinding  is  used  on  the 
coarse  grid  even  in  the  transonic  case.  Empirically,  the  convergence  may  be  slowed  down 
if  the  density  upwinding  is  used  at  the  coarse  grid,  since  a  poorly  located  shock  may  be 
“resolved”  and  added  to  the  fine  grid  solution.  We  do  not  fully  understand  the  reason  for 
this  slowdown,  and  believe  we  are  not  alone  in  regarding  the  choice  of  a  coarse  grid  operator 
for  mixed  elliptic-hyperbolic  problems  as  one  of  the  most  important  outstanding  questions 
in  multilevel  preconditioning. 

The  interaction  of  the  coarse  and  the  subdomain  preconditioners  is  through  the  inter¬ 
polation  and  restriction  operations.  We  define  the  coarse-to-fine  interpolation  matrix,  1^, 
as  follows.  Let  =  {Uj}  be  an  Mh  x  Mh  matrix,  and 

h,j  —  ’®'i(^i)) 

where  x,  is  ith  fine  mesh  node.  The  fine-to-coarse  restriction  matrix  is  defined  as  (/^)^, 
the  transpose  of  ij^.  The  additive  Schwarz  preconditioner  can  be  written  as 

B-^  =  +  hB^\hf  -K  •  •  •  +  (17) 

Let  n[  be  the  total  number  of  nodes  in  ft^,  then  is  an  Mh  x  n-  extension  matrix  that 
extends  each  vector  defined  on  ft^  to  a  vector  defined  on  the  entire  fine  mesh  by  padding 
an  n\  x  n]  identity  matrix  with  zero  rows. 

Various  inexact  additive  Schwarz  preconditioners  can  be  constructed  by  replacing  the 
matrices  H,,  i  >  0,  in  (17)  with  convenient  and  inexpensive  to  compute  matrices,  such  as 
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those  obtained  by  using  local  incomplete  factorizations.  The  coarse  grid  operator  Bq  ^  is 
always  applied  exactly.  Some  detailed  comparisons  of  (17)  with  global  ILU  preconditioners 
on  rather  general  scalar  problems  can  be  found  in  [5].  Experience  with  transonic  potential 
problems  in  the  Boeing  TRANAIR  code  can  be  found  in  [40]. 

5  Parallel  implementation  issues 

We  implemented  the  family  of  NKS  algorithms  on  the  IBM  SP2.  The  top-level  message- 
passing  calls  are  implemented  through  the  Chameleon  Package  of  Gropp  and  Smith  [19], 
which  uses  the  IBM  MPL  library. 

The  code  is  written  in  a  hostless  manner.  Each  processor  is  assigned  one  subdomain, 
and  the  information  pertaining  to  the  interior  of  the  subdomain  is  uniquely  owned  by  that 
processor  and  is  not  available  to  any  other  processors  except  by  message  passing.  FoUowing 
the  parallel  complexity  study  in  [18],  the  low-storage  coarse  mesh  information  is  duplicated 
in  each  of  the  processors.  On  each  processor,  we  store  the  subvectors  and  subblocks  of 
the  Jacobian  matrix  associated  with  an  extended  subdomain.  For  the  coarse-grid  precondi¬ 
tioner,  the  right-hand  vector  is  built  by  a  parallel  fine-to-coarse  restriction  operation.  Once 
the  right-hand  vector  is  obtained,  the  coarse  linear  system  is  solved  simultaneously  on  all 
of  the  processors.  The  solution  is  then  added  to  the  local  subdomain  solutions  by  using  a 
parallel  coarse-to-fine  interpolation  operation.  In  all  the  experiments  that  we  have  done, 
the  size  of  the  coarse  linear  system  is  so  small  that  the  CPU  time  spent  on  it  is  negligible. 
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Figure  2:  Illustration  of  two-way  buffer  copies  required  at  each  nearest-neighbor  boundary. 
For  each  action  of  the  Schwarz  preconditioner  on  a  vector  the  data  needed  in  the  extended 
regions  are  copied  from  the  interior  of  neighboring  subdomains.  The  amount  of  data  moved 
for  each  processor  is  proportional  to  the  area  of  overlap. 

The  multiplication  of  a  vector  with  the  Schwarz  preconditioner  is  the  most  expensive 
operation  in  terms  of  memory  consumption  and  execution  time.  At  the  beginiung  of  each 
nonlinear  iteration,  the  ^-dependent  local  and  coarse  grid  preconditioning  matrices  are 
computed  explicitly,  and  stored  in  Compressed  Sparse  Row  (CSR)  format.  According  to 
the  desired  type  of  local  solver  (see  below),  the  matrices  are  factored,  and  the  upper  and 
lower  triangular  parts  stored.  The  matrices  for  the  interpolation  and  restriction  between 
the  coarse  and  fine  meshes  are  independent  of  $,  and  are  calculated  in  a  preprocessing 
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step.  After  the  solution  of  each  subproblem  is  obtained,  those  portions  that  lie  within 
the  overlapping  regions  (bounded  by  the  dashed  boxes  in  Fig.  2)  are  sent  to  neighboring 
subdomains  to  complete  the  summation  defined  in  (17).  The  length  of  the  message  is 
proportional  to  the  area  of  overlap. 


Figure  3:  Domain  D  with  an  exaggerated  NACA  0012  curve  at  the  bottom.  The  dashed  lines 
indicate  the  partition  of  the  domain  into  nonoverlapping  substructures,  and  the  dotted  lines 
indicate  the  overlapping  subdomains.  The  incomplete  fine  mesh  of  solid  lines  illustrates  an 
overlap  of  3  subintervals.  Fe  is  the  inflow,  Fs  the  freestream,  and  F4  the  outflow  boundary. 


6  Numerical  results 

In  this  section,  we  report  some  numerical  results  obtained  on  the  IBM  SP2  with  up  to  32 
processors  for  both  subsonic  and  transonic  flows.  The  SP2  offers  subsets  of  dedicated  nodes 
through  a  batch  scheduler.  Other  jobs  on  different  dedicated  subsets  share  the  communica¬ 
tion  network,  but  processor  allocation  tends  to  concentrate  intercommunicating  processors 
onto  independent  subnetworks.  We  report  five  performance  metrics  for  each  run:  (1)  the 
total  number  of  Newton  iterations;  (2)  the  total  number  of  GMRES  iterations;  (3)  the  total 
execution  time  (including  the  pre-processing  step  such  as  the  decomposition  of  the  mesh,  the 
calculation  of  message  lengths  and  the  allocation  of  sparse  matrices,  aU  communication  and 
synchronization  overhead,  etc.),  which  is  an  average  over  all  processors;  (4)  the  megaflop 
rate,  which  is  a  sum  of  the  rates  on  each  processor;  and  (5)  the  total  communication  time, 
which  is  an  average  over  all  processors  (isolated  out  of  (3)).  Metrics  (1)  and  (2)  are  of 
interest  in  understanding  convergence  rates,  while  (3),  (4)  and  (5)  are  useful  in  assessing 
bottom-line  performance  and  modeling  scalability. 
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The  computer  code  was  first  developed  on  a  network  of  workstations,  and  then  moved 
to  the  IBM  SP2,  changing  only  a  UNIX  makefile.  To  obtain  the  best  performance  of  the 
code,  in  terms  of  either  the  elapsed  time  or  the  megaflop  rate,  is  not  the  main  purpose  of 
this  paper.  We  provide  the  execution  time  and  megaflop  information  for  all  the  calculations 
for  completeness.  Though  compiler  optimization  was  used,  the  listed  megaflop  rates  are  an 
order  of  magnitude  below  their  peak  values.  Greater  attention  to  cacheing  is  undoubtedly 
required  to  improve  this  situation,  and  will  potentially  be  simplified  when  addressed  in  the 
future  by  the  domain-oriented  structure  of  the  software. 

6.1  Test  problem  and  parameter  selection 

fi  is  a  unit-aspect  ratio  square  partitioned  into  a  uniform  rectangular  meshes  up  to  512  x  512 
in  size.  Let  Qoo,  the  farfield  flow  speed,  be  normalized  to  1.  In  Fig.  3,  let  #oo  = 

We  assume  the  following  boundary  conditions. 


•  On  the  farfield  boundaries  r4,  Fs  and  Fe,  we  assume  $  =  $oo- 


•  On  F2, 


-—  =  -V$oo 

oy 


(Uj;, 


where  n  =  inx,ny)  is  the  unit  outward  normal,  and  where  y  =  f{x)  describes  the 
shape  of  airfoil  for  a:  e  F2.  Once  the  function  f{x)  is  given,  this  condition  becomes 


dy 


-9oo/(a:). 


•  On  Fi  and  F3,  we  impose  for  symmetry  the  no  penetration  condition 

^  ^  _  0 
dn  dy' 


The  functional  form  used  for  the  NACA0012  geometry  [38]  is 

f{x)  =  0.17814(v^  -x)  +  0.10128(a;(l  -  x))  -  0.10968a:2(  1  -  x )  +  0.06090x3(1  -  x), 

for  X  €  (0,1).  This  unit  interval  is  scaled  to  (1/3,  2/3)  in  the  overall  domain.  The  blunt 
leading  edge  of  the  airfoil  poses  a  technical  problem  for  the  transpiration  boundary  condi¬ 
tion,  since  f{x)  is  undefined  there,  so  we  slightly  modify  the  function  /(x).  The  curve  in 
the  interval  [0, 0.047059]  is  replaced  by  a  parabola  with  a  matching  function  value  at  x  =  0, 
and  matching  function  and  first  derivative  values  at  x  =  0.047059. 

A  number  of  parameters  need  to  be  specified  in  the  NKS  algorithms.  The  selection  of 
some  parameters,  such  as  the  number  of  subdommns,  is  related  to  the  granularity  of  the 
architecture,  not  to  the  equation,  itself.  Altogether,  we  have 

•  Switching-function  parameters,  in  the  transonic  case  (§3.2).  The  level  of  maximization 
of  the  switching  function  is  set  to  2,  uq  is  1.0,  and  the  cutoff  Mach  value  is  —  0.95. 
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•  Finite  differencing  parameter,  e  (§4.1).  We  find  that  for  the  nondimensional  scalar 
full  potential  equation,  the  numerics  are  not  very  sensitive  to  e.  We  simply  set  it  to 
10~®,  near  the  square  root  of  the  machine  epsilon. 

•  Newton  convergence  parameters  (§4.1).  The  initial  guess  is  a  simple  interpolation  of 
the  farfield  boundary  condition.  Nonlinear  convergence  is  declared  following  a  10~^° 
relative  reduction  of  the  initial  residual.  The  step  size  reduction  ratio  in  the  line 
search  is  0.5  and  the  termination  tolerance  is  10“^. 

•  Krylov  convergence  parameters  (§4.2).  The  convergence  tolerance  for  the  linear  iter¬ 
ative  solver  at  each  Newton  iteration  rjk  =  10"^||F’($fc)||2  .  We  restart  GMRES  at 
every  30th  iteration. 

•  Number  of  subdomains,  ns  (§4.3).  Since  only  the  additive  version  of  Schwarz  is  under 
consideration,  we  always  set  that  the  number  of  subdomains  is  the  same  as  the  number 
of  processors,  np,  which  varies  from  8  (the  minimum  required  to  store  the  problem)  to 
32  (the  maximum  available  within  power-of-two  configurations).  (In  a  multiplicative 
algorithm  [37],  we  would  set  n  to  np  times  the  number  of  colors.) 

•  Overlapping  size,  ovlp  (§4.3).  In  fact,  there  are  two  overlapping  sizes,  in  x  and  y 
directions.  In  this  paper,  we  assume  the  same  number  of  fine  mesh  cells,  ovlp  = 
1, . . . ,  5,  are  extended  in  both  directions. 

•  Coarse  grid  size  (§4.3).  This  varies  from  no  coarse  grid  (0  x  0)  to  a  coarse  grid  with 
a  modest  number  of  points  in  each  subdomain  (10  x  11).  (The  coarse  grid  cells  are 
square,  but  asymmetry  in  the  employment  of  Neumann  boundary  conditions  in  the  x 
and  y  directions  makes  the  total  number  of  gridpoints  off  by  one.) 

•  Level  of  fiU,  k,  in  ILU  (§4.3).  According  to  our  past  experience  with  multilevel  pre¬ 
conditioning  [5]  and  similar  experience  on  a  industrial-grade  transonic  potential  code 
[33],  relatively  modest  fill-in  is  optimal  for  small  subdomains.  Intuitively,  little  is 
lost  relative  to  the  coupling  already  sacrificed  at  subdomain  boundaries.  However, 
as  the  local  memory  keeps  increasing  on  powerful  modern  parallel  computers,  such 
as  the  IBM  SP2,  the  size  of  the  subdomain  problems  can  be  quite  large.  For  large 
subdomain  problems,  low  level  of  fiU-in  is  no  longer  as  effective,  k  varies  from  0  to  5 
in  our  experiments,  then  jumps  discontinuously  to  the  full  band  in  the  case  of  exact 
subdomain  solves. 

6.2  Observations  —  subsonic  case 

Our  first  test  case  corresponds  to  a  subsonic  problem  with  Moo  =  0.1.  The  linear  systems 
that  arise  fall  within  the  elliptic  theory  for  Schwarz  [37].  It  takes  6  Newton  iterations  to 
reduce  the  initial  nonlinear  residual  by  a  factor  of  10~^°.  Because  of  the  Krylov  dimension 
cut-off,  the  convergence  is  linear;  see  the  left  panel  in  Fig.  4.  The  top  portion  of  Table  1 
shows  the  convergence  performance  for  a  fixed- size  problem  of  512  x  512  uniform  cells  with  an 
increasing  number  of  subdomains:  8, 16  and  32.  The  overlap  size  is  fixed  at  3h.  The  density 
of  the  unnested  uniform  coarse  grid  varies  from  0  x  0  to  10  x  11.  Key  observations  from  this 
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NACA0012,  Mach  =  0.10 


Figure  4:  For  Moo  =  the  left  figure  shows  the  history  of  the  Newton  residual,  and  the 
right  shows  the  (upper  surface)  Cp  curve  at  convergence. 


example  are  as  Mows:  (1)  Even  a  modest  coarse  grid  makes  a  significant  improvement  in 
an  additive  Schwarz  preconditioner,  especially  when  the  number  of  subdomains  is  large.  As 
much  as  40%  of  the  execution  time  can  be  saved  when  adding  a  2  x  3  coarse  grid  to  a  no  coarse 
grid  preconditioner,  for  the  32-subdomain  case.  (2)  A  law  of  diminishing  returns  sets  in  at 
roughly  one  point  per  subdomain.  (3)  When  using  8  processors,  the  total  communication 
time  is  always  less  than  5%  of  the  total  computational  time,  however,  it  becomes  as  much 
as  26%  when  using  32  processors.  (This  includes  synchronization  delays  as  weU  as  the  time 
actually  delivering  the  message  packets  from  application  process  to  application  process.) 
Table  2  shows  the  effects  of  the  overlap  size.  For  simplicity,  we  fix  the  coarse  grid  to  6  x  7 
for  aU  test  cases.  The  overlap  size  is  given  here  in  absolute  terms,  i.e.,  the  distance  between 
the  boundary  of  the  unextended  subdomain  and  the  extended  subdomain,  not  relative  to 
the  diameter  of  the  subdomain.  All  the  subproblems  are  solved  with  the  exact  Gaussian 
ehmination  in  sparse  format.  Since  the  fine  mesh  size  is  fixed,  when  using  smaU  number  of 
processors,  such  as  np  =  8,  the  single  processor  memory  requirement  is  substantial.  In  this 
case,  increase  the  overlap  size  can  indeed  reduce  the  total  number  of  GMRES  iterations, 
but  the  reduction  of  the  total  execution  time  is  rather  limited. 

In  Table  3,  we  present  the  results  when  the  subproblems  are  solved  with  ILU(A:)  for 
various  levels  of  fiU-in.  The  overlap  size  is  3/i,  and  the  coarse  grid  is  7  x  8.  The  conclusion 
from  the  tests  shown  is  that  the  larger  the  k,  the  faster  the  method  becomes;  see  the  boxed 
numbers  in  Table  3.  When  using  a  small  number  of  processors,  like  8,  the  best  execution 
time  is  obtained  with  ILU(5);  compare  the  upper  portions  of  Tables  1,  2  and  3,  However,  if 
the  processor  number  is  large,  the  optimal  result  can  only  be  obtained  by  considering  several 
parameters:  ovlp,  k,  the  coarse  mesh  size,  and  perhaps  others.  We  have  not  simultaneously 
varied  aU  relevant  parameters  to  get  the  best  results,  but  have  presented  controUed  slices 
through  parameter  space  for  insight. 

Load  balancing  should  not  be  a  significant  issue  in  the  dedicated-processor  subsonic  case. 
All  processors  have  nearly  the  same  computational  load,  except  those  which  have  to  handle 
the  Neumann  boundary  conditions.  This  is  no  longer  true  for  the  transonic  calculation, 
when  a  shock  resides  in  some  of  the  subdomains.  See  §6.4. 
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Figure  5:  For  Moo  =  0.8,  the  left  figure  shows  the  history  of  the  Newton  residual,  and  the 
right  shows  the  (upper  surface)  Cp  curve  at  convergence. 


NACA0012,  Mach  s  0.80  NACA0012,  iso-Mach  lines,  Mach  number  =  0.80 


Figure  6:  The  left  figure  shows  the  convergence  history  of  the  Cp  curves  at  Moo  =  0.8.  The 
right  figure  shows  the  Mach  contours  of  the  final  solution  at  Moo  =  0.8. 

6.3  Observations  —  transonic  case 

Figure  5  shows  the  convergence  history  and  converged  Cp  for  a  transonic  problem  with 
Moo  =  0.8.  The  first  and  most  important  observation  is  that  without  a  proper  upwinding 
discretization,  all  three  components  of  NKS  can  fail.  Fig.  6  shows  the  convergence  history 
in  terms  of  the  Cp  curves.  We  note  that  it  takes  only  4  to  5  iterations  for  the  Newton’s 
method  to  establish  the  neighborhood  of  the  shock,  but  another  15  or  so  iterations  to  move 
it  to  the  exact  location.  Mach  contours  at  the  final  solution  are  given  in  Fig.  6.  While  the 
shock  is  setting  up,  the  linear  convergence  of  Newton’s  method  is  interrupted;  see  the  left 
panel  of  Fig.  5. 

The  results  for  coarse  grids  of  varying  size  are  summarized  at  the  bottom  part  of  Table  1. 
The  columns  marked  0x0  and  2x3  reveal  an  interesting  result  for  a  mixed  elliptic-hyperbolic 
problem.  The  inclusion  of  a  small  coarse  grid  can  reduce  the  total  number  of  the  linear 
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Figure  7:  The  GMRES  convergence  history  for  the  entire  nonlinear  iteration.  The  left  figure 
does  not  have  a  coarse  space  in  the  Schwarz  preconditioner,  the  right  figure  contains  a  7  x  8 
coarse  space.  The  number  at  the  tail  of  each  curve  corresponds  to  the  Newton  iteration 
number. 

iterations,  as  well  as  the  total  execution  time,  by  a  factor  of  30%.  An  optimally  chosen 
coarse  grid  size  can  lead  to  a  greater  savings.  In  Fig.  7,  we  overlay  the  convergence  histories 
of  all  the  linear  solutions  in  a  complete  nonlinear  calculation.  The  history  in  the  left  panel 
is  without  a  coarse  grid,  and  that  in  the  right  with  a  7  x  8  coarse  grid.  The  corresponding 
execution  time  requirements  can  be  found  in  Table  1. 

The  number  of  linear  iterations  and  the  total  execution  time  can  be  reduced  even  further 
if  a  proper  overlap  size,  which  is  not  usually  very  small,  is  used;  see  Table  2. 

The  best  result,  in  terms  of  the  total  execution  time,  among  all  the  test  calculations  is 
obtained  using  a  ILU(it),  with  A:  =  5,  as  the  subproblems  solver;  see  Table  3.  It  takes  less 
than  2^  minutes  on  the  32-processor  IBM  SP2  to  set  up  and  solve  the  Mach  0.8  nonlinear 
system  with  more  than  a  quarter  of  a  million  unknowns. 

6.4  Parallel  efficiency 

The  parallel  efficiency  of  the  present  algorithm-software-hardware  system  is  encouraging, 
but  it  is  useful  to  sort  out  in  detail  where  efficiency  is  lost  in  going  from  8  to  32  processors. 
We  display  the  parallel  performance  in  Table  4,  whose  first  three  columns  are  excerpted 
from  the  first  and  last  columns  of  Table  3.  The  execution  time  data  in  the  last  column 
of  Table  3  is  the  best,  or  nearly  the  best,  for  each  Mach  number  and  parallel  granularity 
out  of  all  of  the  parameter  combinations  considered,  and  is  therefore  the  most  meaningful 
from  which  to  draw  parallel  efficiency  conclusions,  though  more  flattering  conclusions  could 
be  drawn  from  runs  that  were  performing  more  computation  per  node  per  communication 
exchange. 

After  the  number  of  processors,  we  list  the  number  of  linear  GMRES  iterations  per  6 
Newton  steps  in  the  upper  (subsonic)  half  of  the  table,  and  per  19  Newton  steps  in  the 
lower  (transonic)  half  of  the  table.  Then  we  list  the  execution  time,  per  6  or  19  Newton 
steps,  respectively,  these  being  the  typical  number  of  Newton  steps  required  to  fully  solve 
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the  nonlinear  problem.  Fig.  7  shows  that,  for  the  transonic  case,  the  number  of  GMRES 
steps  can  vary  significantly  over  the  course  of  a  complete  set  of  Newton  iterations,  but  the 
mean  and  the  median  are  close. 

Consider  the  following  idealized  model  for  the  execution  time  of  the  fixed-size  problem 
on  p  processors.  Let  T{p)  be  the  overall  execution  time,  /(p)  the  number  of  linear  iterations, 
and  C'(p)  the  average  cost  per  iteration.  (We  note  that  in  GMRES,  the  average  cost  per 
iteration  is  not  independent  of  the  number  of  iterations,  because  of  the  orthogonalization 
overhead  of  later  Krylov  vectors  is  greater  than  earlier  vectors,  but  we  assume  that  the 
dominant  cost  per  iteration  is  the  parallel  Schwarz  preconditioning.) 

The  overall  parallel  efficiency  is  defined  as  p(p)  =  where  T(p)  =  /(p)  •  C'(p). 

Since  we  lack  results  for  p  =  1  on  this  problem  of  industrial  size  (-512  x  512),  we  replace 
aU  efficiencies  by  relative  efficiencies  with  respect  to  the  minimum  configuration  of  pi  =  8. 
The  overall  relative  parallel  efficiency  is  therefore  defined  as  T](pi  —>■  p)  =  •  The 

numerical  efficiency,  a  measure  of  the  robustness  of  the  preconditioning  with  respect  to 
increasing  granularity,  is  PnumeriPi  p)  =  The  implementation  efficiency  is  the 

remaining  factor,  T]ijnpi{pi  p)  =  ^p.clp)^ ’  that  p(pi  p)  =  flnumerip)  ^  Vimpl{p)- 

The  numerical  efficiency  is  nearly  90%  or  above  for  all  cases  —  that  is,  the  convergence 
rate  of  the  preconditioned  linear  system  hardly  degrades  with  increasing  parallel  granu¬ 
larity.  Approximately  13  GMRES  steps  are  required  for  each  subsonic  Jacobian  system 
and  approximately  23  GMRES  steps  for  each  transonic  Jacobian  system.  This  insensitivity 
to  granularity  for  a  multilevel  preconditioned  operator  is  predicted  by  the  Schwarz  theory 
for  the  subsonic  case,  and  seems  to  be  a  fortunate  consequence  of  the  relatively  confined 
supersonic  pocket  of  flow  in  the  transonic  case. 

The  implementation  efficiency  accounts  for  the  most  significant  factor  of  overall  efficiency 
decline.  The  difference  between  the  subsonic  and  transonic  implementation  efficiencies  at 
high  granularity  can  be  attributed  to  load  imbalance,  since  the  cells  requiring  upwinding  are 
concentrated  into  a  small  number  of  processors.  (A  more  sophisticated  dynamic  mapping 
algorithm  could  address  this  problem,  but  this  is  beyond  present  scope.)  The  subsonic 
degradation  of  76%  in  going  from  4  to  32  nodes  is  identified  as  the  chief  remaining  loss. 
Redundant  work  and  higher  communication-to-computation  ratio  in  the  overlap  regions, 
which  account  for  a  steadily  increasing  fraction  of  all  points  in  a  fixed-size  problem  explain 
the  majority  of  this  efficiency  loss,  which  would  disappear  in  a  scaled  problem  with  fixed-size 
subdomains  on  each  processor. 

6.5  Sequential  comparison  with  global  IL\J(k)  preconditioners 

The  results  of  this  section  establish  Schwarz  preconditioning  as  numerically  attractive  and 
reasonably  parallel  efficient,  but  it  is  natural  to  ask  whether  its  utility  is  limited  to  dis¬ 
tributed  memory  implementations  of  Newton-Krylov  methods.  To  satisfy  curiosity  on  this 
point,  we  conclude  with  tests  of  Schwarz  preconditioning  against  the  popular  global  ILU(A;), 
k  =  0,..  .5,  family  of  preconditioners  on  a  non-dedicated  singlerprocessor  SUN  SPARCsta- 
tion  with  512MB  of  memory.  The  results  are  summarized  in  Table  5.  Because  of  the  overlap 
and  the  coarse  solve,  the  Schwarz  preconditioner  needs  more  memory,  even  if  all  subdomain 
problems  are  solved  inexactly  with  ILU(5),  than  the  other  global  ILU(/;)  preconditioners. 
On  the  other  hand,  Schwarz  outperforms  aU  the  global  solvers  in  terms  of  total  GMRES 
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iteration  count  and  the  total  execution  tinae.  Part  of  the  reason  for  the  fine  performance  of 
the  Schwarz  method  is  the  much  higher  uni-processor  Megafiop  rating,  which  is  presumably 
related  to  much  improved  cache  locality. 


7  Conclusions 

We  have  investigated  computationally  the  effectiveness  of  Newton-Krylov-Schwarz  methods 
applied  to  the  full  potential  equation  of  aerodynamics  in  some  simplified  situations  in  two 
space  dimensions.  Best  performance  is  obtained  with  modest  overlap,  a  modest  coarse  grid 
(one  or  two  points  per  processor),  modest-to-generous  fill  in  the  subdomain  ILU  precon¬ 
ditioners,  and  uniformly  loose  convergence  tolerances  on  the  Krylov  iterations  within  each 
Newton  step.  For  subsonic  problems,  the  theoretically  expected  performance  of  the  method 
is  essentially  achieved.  For  the  transonic  case,  the  numerics  are  more  encouraging  than  ex¬ 
isting  theory.  Overall  computation  time  is  approximately  six  times  greater  for  the  transonic 
than  for  the  subsonic  case,  with  current  upwinding  strategies.  This  can  be  factored  into 
a  three-fold  increase  in  the  number  of  Newton  steps  in  the  transonic  case,  and  a  two-fold 
increase  in  the  number  of  Krylov  iterations  per  Newton  step. 

Two  strategies  that  should  be  employed  on  more  nonlinearly  taxing  problems  that  we 
have  not  considered  here  are  mesh  sequencing  and  pseudo-transient  continuation.  Their 
purpose  is  to  deliver  an  initial  iterate  for  the  steady-state  form  of  Newton  s  method  employed 
in  this  paper  that  is  already  in  the  local  domain  of  convergence  on  the  finest  grid.  (Observe, 
for  instance,  that  the  number  of  Newton  steps  required  on  Moo  =  0.8  problem  the  256  x  256 
grid  in  Table  5,  is  roughly  half  that  of  the  corresponding  problem  on  the  512  x  512  grid  in 
Table  .3.  If  the  shock  is  correctly  located  on  a  (relatively)  coarse  grid,  the  plateau  of  Fig.  5 
will  be  diminished  on  a  finer  grid  that  is  initialized  from  the  coarse  grid  solution.)  Our  rapid 
turnaround  times  for  two-dimensional  problems  artificially  deemphasize  the  importance 
of  these  strategies  in  large,  complex  nonlinear  problems.  In  addition  to  globalizing  the 
Newton  convergence,  continuation  strategies  tend  to  improve  the  linear  conditioning  of  the 
intermediate  problems,  and  are  therefore  potentially  useful  even  in  problems  (such  as  ours) 
for  which  simple  initial  guesses  on  the  finest  grid  lead  to  convergence. 

The  broadest  motivation  for  Newton-Krylov-Schwarz  methods  is  the  need  to  solve  large- 
scale  problems  with  complex  discretizations  on  distributed-memory  systems  with  limited 
memory  per  node.  The  matrix-free  aspect  of  Newton  permits  shortcuts  in  Jacobian  for¬ 
mation  storage  while  the  domain  decomposition  aspect  of  Schwarz  leads  to  load-balanced 
data-to-memory  maps  that  render  communication  subdominant  in  the  preconditioning.  The 
amount  of  work  done  in  the  Krylov  iteration  can  be  adjusted  to  produce  an  overall  method 
with  the  best  balance  between  the  nested  components. 
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Table  1;  Varying  the  coarse  grid  size.  Fine  mesh  512  x  512,  Moo  —  O-l  0.8,  sparse 
LU  for  all  subproblems,  ovlp  =  Zh.  “Newton”  is  the  total  number  of  Newton  iterations. 
“GMRES”  is  the  total  number  of  GMRES  iterations  occur  in  all  of  the  Newton  iterations. 
“EXEC”  is  the  execution  time  per  processor  in  seconds  for  the  entire  calculation.  “COMM” 
is  the  total  communication  time  per  processor  in  seconds. 


np 

Coarse  Grid 

0x0 

2x3 

4x5 

6x7 

8x9 

10  x  11 

Moo 

=  0.1 

Newton 

6 

6 

6 

6 

6 

6 

GMRES 

144 

81 

59 

53 

50 

51 

8 

EXEC 

136.79 

125.30 

104.18 

97.28 

94.18 

94.63 

COMM  ^ 

2.14 

2.10 

1.95 

2.39 

3.28 

4.24 

Mflop/s 

183.59 

157.02 

170.67 

177.63 

180.94 

180.93 

Newton 

6 

6 

6 

6 

6 

6 

GMRES 

167 

92 

66 

54 

54 

54 

16 

EXEC 

72.18 

50.34 

42.10 

39.18 

40.95 

42.32 

COMM 

2.25 

2.38 

2.37 

2.88 

3.97 

5.48 

Mflop/s 

295. i9 

266.83 

277.39 

280.01 

272.65 

265.44 

Newton 

6 

6 

6 

6 

6 

6 

GMRES 

227 

105 

72 

64 

53 

57 

32 

EXEC 

47.94 

29.55 

24.32 

24.46 

23.43 

26.59 

COMM 

3.36 

2.92 

2.95 

4.13 

4.51 

7.00 

Mflop/s 

498.04 

477.99 

463.72 

440.64 

423.16 

390.33 

Moo  =  0.8  • 


Newton 

20 

19 

19 

20 

19 

19 

GMRES 

814 

435 

359 

350 

307 

311 

8 

EXEC 

757.45 

548.52 

490.64 

498.08 

458.75 

466.20 

COMM 

10.97 

11.63 

12.16 

15.68 

19.07 

24.52 

Mflop/s 

158.70 

158.47 

162.20 

163.44 

163.00 

161.15 

Newton 

19 

19 

20 

20 

19 

19 

GMRES 

849 

483 

398 

349 

320 

330 

16 

EXEC 

432.78 

327.77 

306.25 

291.14 

277.06 

289.89 

COMM 

11.69 

12.90 

14.14 

18.43 

23.26 

32.79 

Mflop/s  1 

233.50 

224.48 

219.17 

215.38 

210.78 

204.68 

Newton 

19 

19 

19 

19 

20 

20 

GMRES 

1142 

636 

454 

387 

391 

385 

32 

EXEC 

333.79 

244.36 

207.54 

200.21 

215.46 

227.84 

COMM 

16.09 

18.43 

19.40 

25.26 

.34.46 

47.68 

Mflop/s 

360.18 

333.53 

317.77 

298.29 

284.96 

267.36 
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Table  2:  Varying  the  overlapping  size  ovlp.  Fine  mesh  512  x  512,  Moo  =  0.1  and  0.8. 
Exact  LU  for  all  subproblems.  Coarse  grid  size  6x7.  “Newton”  is  the  total  number  of 
Newton  iterations.  “GMRES”  is  the  total  number  of  GMRES  iterations  occur  in  all  of  the 
Newton  iterations.  “EXEC”  is  the  execution  time  per  processor  in  seconds  for  the  entire 
calculation.  “COMM”  is  the  total  communication  time  per  processor  in  seconds. 


np 

ovlp  =  Ih 

ovlp  =  2h 

ovlp  = 

3/i 

II 

o 

ovlp  = 

bh 

Moo  =  0.1 

Newton 

6 

6 

6 

6 

6 

GMRES 

77 

60 

53 

50 

50 

8 

EXEC 

101.12 

97:31 

98.31 

100.50 

COMM 

3.22 

2.59 

2.46 

2.36 

2.59 

Mflop/s 

161.03 

174.33 

177.61 

186.71 

Newton 

6 

6 

6 

6 

6 

GMRES 

82 

61 

54 

49 

49 

16 

EXEC 

51.77 

45.46 

44.05 

44.42 

45.21 

COMM 

3.87 

3.07 

2.95 

2.77 

2.64 

Mflop/s 

267.85 

276.36 

279.31 

277.96 

284.54 

Newton 

6 

6 

6 

6 

6 

GMRES 

93 

64 

57 

52 

32 

EXEC 

30.35 

26.75 

26.31 

mm 

COMM 

5.42 

4.24 

3.64 

3.58 

3.19 

Mflop/s 

387.34 

421.17 

460.17 

Moo  =  0.8 

Newton 

19 

19 

20 

20 

20 

GMRES 

435 

365 

319 

315 

8 

498.78 

491.47 

492.52 

COMM 

17.66 

15.87 

16.31 

15.86 

16.49 

Mflop/s 

145.19 

163.11 

170.98 

172.56 

Newton 

19 

19 

mm 

20 

19 

GMRES 

462 

377 

349 

324 

16 

EXEC 

315.39 

294.34 

292.02 

280.61 

278.21 

COMM 

22.74 

19.22 

18.70 

17.10 

Mflop/s 

216.43 

211.85 

196.48 

221.34 

219.24 

Newton 

19 

19 

19 

19 

19 

GMRES 

627 

445 

387 

354 

344 

32 

EXEC 

251.96 

207.12 

199.90 

188.99 

189.44 

COMM 

.38.08 

24.12 

23.02 

22.31 

Mflop/s 

276.02 

298.73 

313.36 
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Table  3:  Varying  the  level  of  ILU(A;)  fill-in.  Fine  mesh  512  x  512,  Moo  =  0-1  and 
0.8.  Coarse  grid  is  7  x  8.  ovlp  =  Zh.  “Newton”  is  the  total  number  of  Newton  iterations. 
“GMRES”  is  the  total  number  of  GMRES  iterations  occur  in  all  of  the  Newton  iterations. 
“EXEC”  is  the  execution  time  per  processor  in  seconds  for  the  entire  calculation.  “COMM” 
is  the  total  communication  time  per  processor  in  seconds. 


np 

ILU(ifc) 

k  =  0 

ib  =  1 

k  =  2 

k  =  Z 

II 

II 

Moo 

=  0.1 

Newton 

6 

6 

6 

6 

6 

6 

GMRES 

307 

175 

127 

98 

84 

75 

8 

EXEC 

171.46 

110.07 

88.26 

76.11 

74.28 

71.74 

c 

COMM 

14.08 

8.10 

5.87 

4.54 

4.44 

3.62 

Mflop/s 

119.80 

113.63 

109.46 

102.36 

96.00 

95.43 

Newton 

6 

6 

6 

6 

6 

6 

GMRES 

299 

178 

129 

101 

87 

78 

16 

EXEC 

97.82 

64.14 

51.19 

43.34 

40.81 

40.07 

□ 

COMM 

16.73 

9.42 

6.85 

5.38 

4.80 

3.88 

Mflop/s 

208.04 

201.60 

195.17 

188.34 

183.56 

179.69 

Newton 

6 

6 

6 

6 

6 

6 

GMRES 

298 

179 

130 

104 

90 

82 

32 

EXEC 

64.34 

41.92 

32.73 

28.18 

26.34 

25.57 

□ 

COMM 

19.26 

11.44 

8.14 

6.63 

5.70 

4.91 

Mflop/s 

324.76 

321.72 

318.07 

308.78 

304.06 

305.19 

Moo 

=  0.8 

Newton 

19 

20 

18 

19 

19 

19 

GMRES 

1638 

911 

622 

547 

462 

424 

8 

EXEC 

863.15 

523.33 

382.27 

364.11 

335.47 

330.76 

L 

COMM 

77.30 

42.64 

29.13 

25.90 

21.62 

19.81 

Mflop/s 

127.49 

125.37 

124.04 

123.70 

122.13 

119.90 

Newton 

19 

20 

20 

19 

19 

19 

GMRES 

1693 

924 

707 

531 

460 

423 

16 

EXEC 

526.40 

316.98 

255.54 

209.27 

195.96 

192.16 

L 

COMM 

90.88 

49.9 

38.32 

28.55 

25.14 

22.46 

Mflop/s 

220.22 

213.73 

215.42 

214.21 

210.64 

208.59 

Newton 

19 

20 

20 

19 

20 

20 

GMRES 

1746 

961 

799 

596 

557 

501 

32 

EXEC 

361.81 

217.29 

187.37 

150.59 

150.29 

145.12 

L 

COMM 

110.89 

61.46 

51.32 

38.02 

35.60 

32.07 

Mflop/s 

342.40 

336.48 

342.84 

341.52 

343.77 

340.06 

_ 
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Table  4:  Parallel  efficiency.  Fine  mesh  512  x  512,  Moo  =  0.1  and  0.8.  Coarse  grid  is 
7  X  8.  ovlp  =  3h.  “GMRES”  and  “EXEC”  are  the  total  number  of  GMRES  iterations  and 
seconds  of  execution  time  per  6  Newton  steps  in  the  upper  half  of  the  table,  and  per  19 
Newton  steps  in  the  lower  half. 


up 

GMRES 

EXEC 

'Qnumer 

Vimpl 

V 

Moo-- 

=  0.1 

8 

75 

71.1 

~ 

- 

- 

16 

78 

40.1 

0.961 

0.931 

0.895 

32 

82 

25.6 

0.915 

0.766 

0.701 

^OO  ~ 

=  0.8 

8 

424 

330.1 

- 

- 

- 

16 

423 

192.2 

1.002 

0.860 

0.861 

32 

475 

137.9 

0.891 

0.674 

0.601 

Table  5:  Sequential  comparison  of  the  additive  Schwarz  preconditioner (OSM)  with  the 
global  ILU(A;),  A:  =  0, . .  .,5,  preconditioners  on  a  single  processor  Sun  workstation.  The  fine 
mesh  is  256  x  256.  The  specifications  of  OSM  are:  8  subdomains,  3/i  overlap,  7x8  coarse 
grid,  and  ILU(5)  as  the  subdomain  solver.  MEM  is  the  total  memory  needed  to  store  the 
preconditioning  matrix  in  Megabytes. 


OSM 

ILU(O) 

ILU(l) 

ILy(2) 

ILU(3) 

ILU(4) 

ILU(5) 

Moo  = 

0.1 

Newton 

6 

6 

6 

6 

6 

6 

6 

GMRES 

48 

509 

280 

195 

148 

119 

103 

EXEC 

n 

496.29 

2754.54 

1599.84 

1178.55 

970.72 

817.42 

771.04 

MEM(MB) 

24.22 

10.58 

10.57 

13.65 

16.83 

20.12 

22.87 

Mflop/s 

32.12 

6.68 

6.48 

6.19 

6.04 

5.85 

5.40 

Mco  = 

0.8 

Newton 

11 

12 

12 

11 

11 

11 

11 

GMRES 

136 

1391 

666 

464 

328 

268 

217 

EXEC 

1240.18 

7766.42 

3826.05 

2848.73 

2278.81 

1761.89 

1526.82 

MEM(MB) 

24.40 

10.64 

10.63 

13.75 

16.95 

20.26 

23.06 

Mflop/s 

36.-57 

6.53 

6.48 

6.20 

5.63 

5.99 

5.79 
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